Page 1
The Observer Effect in Computer Networks Tal Mizrahi Technion — Israel Institute of Technology Michael Schapira Hebrew University of Jerusalem Yoram Moses Technion — Israel Institute of Technology ABSTRACT Network measurement involves an inherent tradeoff between accuracy and overhead; higher accuracy typically comes at the expense of greater measurement overhead (measurement fre- quency, number of probe packets, etc.). Capturing the “right” balance between these two desiderata – high accuracy and low overhead – is a key challenge. However, the manner in which accuracy and overhead are traded off is specific to the measurement method, rendering apples-to-apples compar- isons difficult. To address this, we put forth a novel analytical framework for quantifying the accuracy-overhead tradeoff for network measurements. Our framework, inspired by the observer effect in modern physics, introduces the notion of a network observer factor, which formally captures the rela- tion between measurement accuracy and overhead. Using our “network observer framework”, measurement methods for the same task can be characterized in terms of their network ob- server factors, allowing for apples to apples comparisons. We illustrate the usefulness of our approach by showing how it can be applied to various application domains and validate its conclusions through experimental evaluation. The unexamined life is not worth living – Socrates 1 INTRODUCTION Network measurement is critical for detecting failures and anomalies, and also for verifying that desired performance bars, e.g., Service Level Agreements (SLAs), are met. Mea- surements can be realized in a variety of ways, including injecting periodic or on-demand control (probe) packets, con- tinuously streaming telemetry information to central analyz- ers, and in-band collection of measurement data. Network measurement involves an inherent tradeoff be- tween accuracy of the measured metric (e.g., network delays, throughput, etc.) and the impact of measuring on network performance (e.g., the excess bandwidth consumed by probe packets, the increase in packet loss rate). Consider, for exam- ple, the challenge of guaranteeing that network delay within the premises of a service provider is below a certain value. To accomplish this, service providers typically rely on periodic This paper is an extended version of [1], published in the ACM Applied Networking Research Workshop (ANRW), 2024. delay measurements. For instance, a service provider that operates a network running MPLS-TP (Multi-Protocol Label Switch Transport Profile) can run periodic measurements us- ing Delay Measurement Messages (DMM) [2]. Naturally, to obtain accurate and timely measurements, DMM messages should be sent at high frequency. However, too high a fre- quency (e.g., 300 messages per second) can entail prohibi- tively high communication overhead (and even affect network delay!). In this example, the measurement accuracy can be quantified in terms of how long it takes to detect that net- work delays exceed the threshold, whereas the performance overhead is the increase in data rate on the wire induced by sending DMM messages. In this paper, we show that for a broad variety of measured metrics and notions of measurement overhead (including those of the above example), the accuracy-overhead tradeoff for any measurement method can be captured by a parame- ter that we call the network observer factor induced by this method. Informally, the network observer factor 𝜂(inspired by the observer factor from modern physics) is a method- specific constant that quantifies the measurement overhead per time units. The network observer factor provides a pow- erful conceptual framework for characterizing the efficiency of a measurement method, and can serve as a useful met- ric for comparing measurement methods for the same task. Going back to the DMM example, suppose that the service provider is considering two possible delay measurement pro- tocols for MPLS-TP with similar functionality, as in [2, 3]. As shall be explained below, by deriving the observer factors for the two protocols, the service provider can perform an ‘apples-to-apples’ comparison of the two protocols. Our contributions are summarized below: • We propose a formal framework for reasoning about the accuracy-overhead tradeoff of a measurement method (the “network observer effect”) and for characterizing measurement methods in terms of their “network ob- server factor”. We view this theoretical framework as a powerful lens for evaluating the efficiency of measure- ment methods and believe that complementing our ap- proach with the prevalent experimental/empirical analy- sis can facilitate deeper insights into inherent tradeoffs between accuracy and overhead. arXiv:2406.09093v1 [cs.NI] 13 Jun 2024
Page 2
• We formally analyze the theoretical properties of the network observer effect and its associated factor, shed- ding light on the relation between performance mea- surement accuracy and measurement rate. In particular, we identify broad classes of measurement metrics and notions of measurement overhead for which the net- work observer factor can be analytically derived. • We validate the usefulness of our approach, as well as our analytical observations, by experimentally analyz- ing several measurement protocols of interest, spanning passive, active, and in-band measurements. In particu- lar, we show how by fixing the level of measurement accuracy, our methodology facilitates apples-to-apples comparison of the efficiency of different measurement methods. 2 INSPIRATION AND OVERVIEW 2.1 The Observer Effect in Physics Heisenberg’s uncertainty principle states that the position and momentum of a particle cannot both be measured precisely. This principle is often expressed by the following uncertainty relation [4]: Δ𝑥· Δ𝑝≥ℏ (1) where Δ𝑥is the level of uncertainty regarding a particle’s position, Δ𝑝is the uncertainty regarding its momentum, and ℏ is the Planck constant. Heisenberg argued [4] that a Gamma ray that is used to measure the particle’s location will affect the particle’s mo- mentum. In other words, the measurement procedure affects the measured system–a phenomenon that later became known as the observer effect. Heisenberg’s uncertainty principle has been found to be valid even if the system is not measured by an observer, and thus the uncertainty principle and the observer effect are regarded as two distinct principles of quantum mechanics. Notably, the uncertainty relation (Eq. 1) is applicable to the analysis of both principles. 2.2 The Network Observer Effect Communication networks can also be regarded as exhibiting an observer effect, in the sense that the act of measuring the performance of a network often affects network performance. Inspired by Heisenberg’s analysis of the uncertainty relation (Eq. 1), we introduce an analogous uncertainty relation. The uncertainty relation in networks: Δ𝑀· Δ𝑃≥𝜂 (2) In the above equation, Δ𝑀is the uncertainty in the mea- sured metric 𝑀and Δ𝑃is the change in a network perfor- mance metric, 𝑃, which is affected by the measurement. Intu- itively, Δ𝑀is the measurement’s uncertainty (or accuracy), whereas Δ𝑃is the measurement’s impact, which quantifies the effect of the measurement on network performance. Δ𝑃 can have different meanings in different contexts, e.g., excess packet loss or delay induced by measurements, or excess data transfer charges induced by measurement probes. Naturally, lower uncertainty comes at the cost of higher measurement overhead. Going back to our DMM example, suppose that our goal is to minimize the time it takes for the node that runs the periodic DMM to detect unusual congestion or failure. To reduce measurement uncertainty in terms of the time it takes to detect anomalous network behavior (Δ𝑀), the frequency of the DMM messages should be increased. This, however, would impact the performance by increasing the data rate on the wire (with an excess data rate of Δ𝑃), as the uncertainty relation suggests. Like Heisenberg’s uncertainty relation, the network un- certainty relation clearly does not apply to all possible per- formance metrics and all possible measurement methods. In Section 4 we characterize the conditions under which the uncertainty relation is applicable. Eq. 2 captures the natural tradeoff between measurement accuracy and overhead; decreasing the uncertainty in per- formance measurement causes the impact on the network performance to increase. This tradeoff is illustrated in Fig. 1, which presents the impact as a function of the measurement uncertainty in a specific scenario. The curve represents the theoretical lower bound of the measurement impact according to Eq. 2. Our experimental evaluation (Section 5) includes experiments that show that the measured impact nearly co- incides with the theoretical lower bound of the uncertainty relation. Impact - ΔP Uncertainty - ΔM Figure 1: The impact as a function of the measurement uncertainty. The curve represents the lower bound, where Δ𝑀· Δ𝑃= 𝜂. 2
Page 3
𝜂is a constant we call the network observer factor, which quantifies the measurement overhead per time unit. Impor- tantly, unlike the global Planck constant, 𝜂is specific to the measurement method. 𝜂can be used as a metric for the effi- ciency a measurement method. Indeed, we contrast different measurement methods in terms of their associated 𝜂con- stants to gain insight into the relation between measurement accuracy and overhead. In our experimental evaluation (see Section 5), we validate our theoretical findings and demon- strate the usefulness of our approach. We accomplish this by evaluating the considered measurement methods under simi- lar uncertainty in measurement accuracy (Δ𝑀) and observing the impact on network performance induced by each method (Δ𝑃), allowing for an apples-to-apples comparison. Interest- ingly, our theoretical analysis and evaluation show that, in the context of network delay measurements, in-situ meth- ods such as In-Network Telemetry (INT), which seemingly require high overhead, have low uncertainty, and thus have almost identical impact on performance as other measurement methods (e.g., passive measurements) under the same level of achieved measurement accuracy. 3 WHY THE OBSERVER EFFECT MATTERS The network observer effect, as presented above, applies to various types of networks: data centers, wide area and carrier networks, campus networks, and even home networks. We argued that a measurement method that has low uncertainty has a high performance impact on the network. One could argue that the fast growth of large-scale networks has opened the door to overprovisioned network resources (e.g., [5]), where the overhead of network measurement may arguably be negligible, or the measurement accuracy could be relaxed. We present a few crisp use cases that demonstrate the ob- server effect in high-speed networks. The use cases demon- strate why highly accurate (detailed) measurement is impor- tant, and why overprovisioning does not necessarily avoid the measurement impact. 3.1 Use Case 1: In-situ Measurement In-situ measurement (also known as in-band or in-network telemetry) provides fine-grained and detailed monitoring and measurement by having every switch or router push teleme- try information into the header of every data packet. Both research-driven and industry-driven protocols have been de- fined in this context (e.g., [6–9]). In-situ measurement allows very detailed information about the path taken by every packet, the performance of switches along the path, and potential fail- ures. However, the obvious penalty is the per-packet overhead which may cost tens of bytes per packet. In-situ measure- ment is an important example of a case where despite the high measurement overhead, this approach is increasingly gaining attention from the networking community due to the fine-grained information it provides, including at least one publicly known deployment [10]. 3.2 Use Case 2: Broadband Home Access The cost structure of home broadband subscriptions is bandwidth- sensitive, and therefore allocating a fraction of the bandwidth to measurement and monitoring would be frowned upon by home subscribers. The overprovisioning approach that has be- come common in public cloud networks [5] is obviously not feasible in home networks. Therefore, an accurate measure- ment is likely to have noticeable impact on the performance of a home subscriber network. 3.3 Use Case 3: Lossless Service Consider a data center network in which storage-related traffic is forwarded as a lossless service. Lossless delivery is guaran- teed by assigning a dedicated traffic class to this service, and by using Priority Flow Control (PFC). The network operator measures the lossless service, as even infrequent packet losses should be monitored and trou- bleshooted. The operator may occasionally encounter a few packet drops, caused by short micro-bursts that temporarily fill the queues. The measurement should be accurate and fine- grained, allowing to detect even a small number of drops. At the same time, the measurement overhead becomes significant when the queues are full or nearly full, since it then causes more packet drops than would be caused if the flow had not been measured. As shown in Fig. 2, when one of the links reaches its capacity, even a small increase in the traffic rate will affect the number of packet drops. This use case demon- strates why even the slightest measurement impact may in some cases be noticeable in the network we aim to measure. 0 10 20 30 40 50 0 500000 1000000 Packet Loss [%] User Traffic Rate [bps] Unobserved Observed link capacity Figure 2: Experimental example of the network observer effect: an observed flow (monitored by IOAM [6]) has a higher packet loss rate than unobserved flow. Fixed user traffic rate. 3
Page 4
3.4 Use Case 4: Service-Level Agreement (SLA) Measurement A Service-Level Agreement (SLA) in carrier networks is an agreement between a customer and a service provider regard- ing the performance of a network service. As defined in the MEF 10.3 specification [11], a Service Level Specification (SLS) defines the performance objectives for a given band- width. I.e., the SLS specifies the agreed delay, delay variation, and loss rate that are guaranteed as long as the customer does not exceed an agreed bandwidth. If a customer is interested in measuring the network, in order to guarantee that the SLS is satisfied, the measurement overhead effectively decreases the user’s bandwidth compared to the bandwidth that is specified in the SLS. 4 ANALYZING THE OBSERVER EFFECT We now investigate the network uncertainty relation, which is inspired by Heisenberg’s uncertainty relation and its connec- tion to the observer effect. We first present the terminology, assumptions, and the model used in the analysis. 4.1 Measurement Classes We analyze the measurement granularity in terms of three main measurement classes [12]: passive, active, and hybrid measurement. Passive Measurement. Passive methods observe net- work traffic without modifying it and without transmitting control packets along the data path. Notably, since measure- ments are performed locally by each network node, mea- surement data is typically exported to a central aggregator. Measurement data may be exported periodically, on-demand, or triggered by specific events. A common protocol used for exporting passive measurement results is gNMI [13], which can be used by network devices to stream telemetry informa- tion to central collectors. Active Measurement. Active methods use synthetic traf- fic to measure the network. Ping and Traceroute are common examples. Home network speed tests (e.g., [14]) perform the measurement by temporarily running synthetic traffic at a high rate, while other methods use periodic control messages, such as Continuity Check Messages (CCM) [15, 16], to con- tinuously monitor the network. To analyze the overhead of active measurement we focus on periodic measurement, which continuously monitors the traffic. This allows to detect a problem once it occurs, but obviously comes at the cost of continuous overhead. In-situ measurement. In-situ (or in-band) methods, such as In-band Network Telemetry (INT) [9] and In situ OAM (IOAM) [6], piggyback measurement data onto live user pack- ets.1 This data is peeled off at a decapsulation node, which also exports some (or all) of the measurement data to a central collector. Note that in-situ measurement is one example of the hy- brid methods defined in [12]. Other hybrid methods are also defined there, but strictly from a measurement overhead per- spective, which is what the current paper is focused on, each of these other methods belongs to one of the three classes above. Notably, we should distinguish between data path overhead and management overhead. Data path overhead is overhead that is induced along the path of the user traffic that is being measured. In contrast, management overhead is required for exporting information to a central collector; the export path may or may not coincide with the data path. Active and in- situ measurements incur data path overhead, since for these methods the overhead occurs along the same path as the user traffic. Furthermore, these two methods typically also entail passive overhead, derived from allowing the measurement in- formation to be exported to a collector. Passive measurement entails only management overhead. In each of the measurement classes, the overhead is quan- tified in a different way. Active measurement uses periodic control messages, and thus the overhead is on a per-time-unit basis. The overhead of in-situ measurement is quantified on a per-packet basis. Passive measurement requires only man- agement overhead, which we also consider on a per-packet basis. Passive Active In-situ Data path overhead ✓ ✓ Management overhead ✓ ✓ ✓ Overhead granularity per time unit per time unit per packet Table 1: Measurement classes. 4.2 Metrics As discussed in Section 1, there is a tight coupling between the desired measurement granularity and the impact of the observation on the measurement. A key question is which metrics, i.e., the 𝑀and 𝑃in Eq.2, are amenable to our type of analysis. 1Note that Direct Exporting [17] or Postcard Mode [18] are forms of passive measurement in our context, as telemetry data is not forwarded along the data path. 4
Page 5
The uncertainty relation is clearly not applicable to all possible performance metrics. In our analysis the measured metric 𝑀is specifically a sensitive performance metric, as defined in the following subsection. The impacted perfor- mance metric 𝑃is a rate metric, i.e., a metric that is affected by the data rate or loss rate. Our analysis focuses on metrics that are impacted by the measurement regardless of the net- work utilization. For example, if Δ𝑃represents the impact of the measurement on the traffic rate in the measured path or link, this impact is inherent in the measurement, regardless of how utilized the network is. Consequently, the analysis of the observer effect is not affected by overprovisioning. We prove below that the uncertainty relation is applicable in this context. 4.3 Model and Definitions Our analysis assumes that measurements are performed for a specific traffic flow or set of flows, where a flow consists of a set of packets with common characteristics, such as 5- tuple properties. The data rate of a traffic flow in a network is defined to be the number of bits per time unit that are success- fully delivered from the source to the destination, including the data plane headers and any overhead that may be incurred by the performance measurement. The loss rate of a flow is defined to be the number of bits per time unit that are sent by the source and not delivered to the destination. It is also assumed that the data rate of the analyzed flow(s) is constant.2 Table 2 summarizes the notation used in this section. 𝑀 A sensitive performance metric that is mea- sured periodically. 𝜏 The measurement period. Δ𝑀(𝜏) The amount of uncertainty in 𝑀when mea- sured periodically with a period 𝜏. 𝑃 A rate metric. Δ𝑃𝑀(𝜏) The measurement impact on 𝑃when 𝑀is mea- sured periodically with a period 𝜏. Θ The measurement overhead, measured in bits per time unit. Table 2: Notations A measurement in our context is a process that observes a traffic flow using a fixed measurement method, which requires an overhead of Θ bits per time unit. The overhead Θ and the amount of uncertainty in the performance metrics 𝑀and 𝑃 depend on the specific measurement method. As discussed in Section 4.1, in different measurement classes the overhead 2This assumption simplifies the analysis of the uncertainty relation. In prac- tice, the data rate is not necessarily constant, but the analysis can be per- formed in sufficiently short time intervals, in which the data rate is roughly constant. may have a different impact on 𝑃, which may in turn affect the data path or the management path (or both). Definition 4.1 (Sensitive performance metric). A perfor- mance metric is said to be sensitive if the amount of un- certainty when the metric is measured periodically with a period 𝜏can be represented by a function Δ𝑀(𝜏) such that Δ𝑀(𝜏) = 𝐶𝑀· 𝜏for some constant 𝐶𝑀. Definition 4.2 (Uncertainty in a measured metric). Let 𝑀be a performance metric that is measured periodically with a period 𝜏, and 𝑀(𝑡) be the measured value at time 𝑡. The uncertainty Δ𝑀(𝜏) is the minimal value for which |𝑀(𝑡′) −𝑀(𝑡)| ≤Δ𝑀(𝜏) for all 𝑡′ such that 𝑡< 𝑡′ ≤𝑡+ 𝜏. Definition 4.3 (Rate metric). A metric that either quantifies the traffic rate or the traffic loss rate is called a rate metric. Definition 4.4 (Measurement impact). We define the mea- surement impact Δ𝑃𝑀(𝜏) of a rate metric 𝑃when a sensitive metric 𝑀is measured with a period 𝜏to be the minimal upper bound on the difference between the value of the metric 𝑃 when 𝑀is measured and the value of the metric 𝑃without the measurement. 4.4 The Network Uncertainty Relation Our analysis starts with a basic claim (Lemma 4.5) about the connection between the measurement impact Δ𝑃𝑀(𝜏) and the measurement overhead Θ. LEMMA 4.5. If a metric 𝑀is measured periodically with a period 𝜏for a given flow, the mean measurement overhead per time unit is Θ, and 𝑃is a rate metric of the flow, then the impact Δ𝑃𝑀(𝜏) satisfies Δ𝑃𝑀(𝜏) ≥Θ. PROOF. A rate metric 𝑃may be either a loss rate metric or a data rate metric. For the given measured metric 𝑀and period 𝜏, we denote the loss rate uncertainty by Δ𝐿𝑀(𝜏) and the data rate uncertainty by Δ𝐷𝑀(𝜏). The loss rate uncertainty results from the measurement overhead, i.e., the highest loss rate occurs when the overhead exceeds the flow bandwidth, causing a loss rate of Θ. Thus the difference between the maximal loss and the minimal loss satisfies Δ𝐿𝑀(𝜏) ≥Θ. The data rate uncertainty results from the fact that measurement (overhead) traffic may either be lost or not, depending on the network utilization (or overprovisioning), and therefore Δ𝐷𝑀(𝜏) ≥Θ. □ The following theorem is the networking variant of Heisen- berg’s uncertainty relation. Intuitively, the theorem captures the tradeoff between the uncertainty in a measured perfor- mance metric 𝑀and its effect on a corresponding rate metric 𝑃. THEOREM 4.6. If 𝑀is a sensitive performance metric of a flow that is measured periodically with a period 𝜏, and 𝑃is 5
Page 6
a rate metric of the flow, then there exists a constant 𝜂such that: Δ𝑀(𝜏) · Δ𝑃𝑀(𝜏) ≥𝜂 (3) PROOF. Since 𝑀is a sensitive performance metric, then by the definition of a sensitive metric there exists a constant 𝐶𝑀such that Δ𝑀(𝜏) = 𝐶𝑀· 𝜏. Therefore Δ𝑀(𝜏) · Δ𝑃𝑀(𝜏) = 𝐶𝑀· 𝜏· Δ𝑃𝑀(𝜏). By Lemma 4.5 Δ𝑃𝑀(𝜏) ≥Θ, and thus 𝐶𝑀· 𝜏· Δ𝑃𝑀(𝜏) ≥𝐶𝑀· 𝜏· Θ. Since a fixed measurement method was assumed, with an overhead of Θ bits per time unit, it follows that 𝜏· Θ is constant; we denote this constant by 𝐶𝜏Θ. Thus, 𝐶𝑀·𝜏·Θ = 𝐶𝑀·𝐶𝜏Θ. We define 𝜂to be 𝐶𝑀·𝐶𝜏Θ, and thus obtain Δ𝑀(𝜏) · Δ𝑃𝑀(𝜏) ≥𝜂. □ 4.5 Understanding the Network Uncertainty Relation In order to understand the impact of Theorem 4.6 we present three examples, in the context of the three measurement classes that were discussed in Section 4.1. Example 1. Continuity Check Messages (CCM) are used in Ethernet OAM [15, 16] in order verify the continuity of an Ethernet link and to detect failures. CCMs are sent periodi- cally, and a failure is reported when a CCM has not arrived within a given timeout. If we define the measured metric 𝑀 to be the detection time of a failure, then the frequency of the CCMs determines the uncertainty in the detection time 𝑀. Once again, the tradeoff between the uncertainty in the detection time and the impact of the measurement on the rate is captured in Theorem 4.6. In this case Δ𝑀(𝜏) = 𝜏, and it is easy to see that 𝜂is equal to the number of overhead bits per period 𝜏. Example 2. Consider a passive measurement process, in which a performance metric 𝑀such as a flow counter is measured by a network device and periodically exported to a collector node. The measurement period 𝜏is an indication of the freshness of the information that the collector obtains from the measured device; the data is fresh immediately after the measurement, but any change in the measured metric after the measurement is only known to the collector upon receiving the next measurement. Thus, Δ𝑀(𝜏) represents the maximal difference between two consecutive measurement values of 𝑀, and is directly proportional to 𝜏. Following Theorem 4.6, any reduction in the uncertainty of 𝑀will result in an increase in the impact on the loss and/or data rate. Example 3. Consider a flow in which IOAM is used for monitoring the network path of the flow. A collector is used to monitor the IOAM data, and detect when the network path changes. The metric 𝑀in this example refers to the number of packets that were forwarded in the flow before a path change occurred. As IOAM may be applied to all packets of the flow or to a subset of the packets, we can define 𝜏to be the mean period between two packets that are monitored by IOAM. Thus, by Theorem 4.6, there is a tradeoff between the metric 𝑀and the impact on the flow rate. 4.6 A Concrete Observer Factor The network uncertainty relation (Eq. 3) uses the constant 𝜂, representing the tradeoff between the measurement’s uncer- tainty and impact. This constant is specific to the measurement method, in contrast to ℏin Eq. 1, which is a universal constant. In Example 1 the factor 𝜂has an intuitive meaning and can be easily computed; it is simply the number of overhead bits per measurement period. The following lemma reflects this point. As in Example 1, we analyze the uncertainty in the detection time, 𝑇, of a failure or anomaly. LEMMA 4.7. Let 𝑇(𝜏) be the detection time in a periodic measurement with a period 𝜏, and let 𝑃be a rate metric. Then Δ𝑇(𝜏) · Δ𝑃𝑇(𝜏) ≥𝜂, where 𝜂is the number of overhead bits per period 𝜏used by the measurement. PROOF. Since the measurement is periodic with a period 𝜏, a failure or anomaly is detected at most 𝜏time units after it occurs, and thus Δ𝑇(𝜏) = 𝜏. By Lemma 4.5 we have that Δ𝑃𝑇(𝜏) ≥Θ. Thus, Δ𝑇(𝜏) · Δ𝑃𝑇(𝜏) ≥𝜏· Θ. Note that 𝜏· Θ is the number of overhead bits per period, and we denote it by 𝜂, yielding Δ𝑇(𝜏) · Δ𝑃𝑇(𝜏) ≥𝜂. □ 4.7 Scaling the Observer Effect An important question that arises with respect to the observer effect is whether it is still relevant at high scales. For ex- ample, one may argue that the impact of a periodic active measurement method, such as the CCM protocol, may be significant in low-bandwidth networks, but becomes insignifi- cant in large-scale networks. We will show that in a precise sense, the measurement impact scales with the size of the network. At a first glance it may seem that the desired uncertainty (or detection time), Δ𝑇(𝜏), does not scale with the network size. The best-known requirement in telecom networks is the ability to recover from a failure within 50 milliseconds. This requirement has not changed in many years, as it is derived from the human ability to detect downtimes in voice calls. However, in large-scale data center networks much faster de- tection times may be required. For example, if we consider a high-bandwidth transaction between servers over a 100 Gbps network interface, a 50 millisecond detection time yields 5 Gi- gabits of data, which may be lost before an error is detected. Thus, the measurement uncertainty requirements become in- creasing more stringent as networks scale. Moreover, even if the desired uncertainty is a fixed requirement that is deter- mined by the application (as in the 50 millisecond example), we expect the measurement overhead and impact to scale with the number of flows. 6
Page 7
1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 0 2 4 6 8 10 Impact [bps] Uncertainty [sec] Measured Theoretical (a) Passive measurement and exporting using gNMI. 1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 0 2 4 6 8 10 Impact [bps] Uncertainty [sec] Measured Theoretical (b) Active measurement using CCM. 1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 0 2 4 6 8 10 Uncertainty [sec] Measured Theoretical (c) In situ measurement using IOAM. Figure 3: The uncertainty relation (Theorem 4.6) in practice: the measurement impact vs. the measurement uncertainty. For each of the three measurement classes the experimental result is compared to the theoretical result (predicted by the uncertainty relation). This intuition is formalized in the following lemma: LEMMA 4.8. Let 𝑇(𝜏) be the detection time in a periodic measurement with a period 𝜏 that is per- formed for 𝑁flows, and let 𝑃be a rate metric. Then Δ𝑇(𝜏) · Í𝑁−1 𝑖=0 Δ𝑃𝑇𝑖(𝜏) ≥𝑁· 𝜂, where 𝜂is the number of overhead bits per period 𝜏. PROOF. We observe that Δ𝑇(𝜏) · Í𝑁−1 𝑖=0 Δ𝑃𝑇𝑖(𝜏) = Í𝑁−1 𝑖=0 Δ𝑇(𝜏) · Δ𝑃𝑇𝑖(𝜏). By Theo- rem 4.7 we obtain that Í𝑁−1 𝑖=0 Δ𝑇(𝜏) · Δ𝑃𝑇𝑖(𝜏) ≥Í𝑁−1 𝑖=0 𝜂= 𝑁· 𝜂. It follows that Δ𝑇(𝜏) · Í𝑁−1 𝑖=0 Δ𝑃𝑇𝑖(𝜏) ≥𝑁· 𝜂. □ This lemma provides an important insight about the scaling of the observer effect: for a fixed detection time, the impact scales with the size of the network. 5 EVALUATION We now present an experimental evaluation of the network observer effect for each of the three measurement classes of Section 4: passive, active and in-situ. We evaluate three significantly different protocols, gNMI, CCM and IOAM, in three different open source environments. All of our evaluation results can be replicated using the detailed instructions and code in [19]. To allow for an ‘apples-to-apples’ comparison, despite the inherent differences between the three classes (Section 4.1), we assume that the measurement overhead is carried over the data path.3 Specifically, in the IOAM measurement we assume that data flows have a constant data rate of 1 Mbps, using a 3Specifically, in passive measurement this refers to the case in which the measurement data is exported along the same path as the data itself. Although our analysis does not mandate this assumption, it is convenient for an apples- to-apples comparison. packet length of 360 bytes.4 This assumption is specifically relevant to IOAM, where measurement overhead is a function of the data rate. 5.1 Experiment Setup Three experimental environments were used in the evaluation: gNMI. The passive measurement experiment setup was based on an open source environment running a Stratum/BMv2 switch [21], emulated in Mininet. Two hosts were connected through the switch, and a counter was used to monitor the traffic through each of the switch’s interfaces. gNMI [13] was used for periodically exporting the value of one of the counters from the switch; we used the gNMI client to sub- scribe to periodic updates from the switch, and varied the export interval from a few milliseconds to a few seconds. The impact (management overhead) of this telemetry stream was analyzed. CCM. The active measurement setup was based on an open source implementation [22] of the IEEE 802.1ag stan- dard [15] for Ethernet Connectivity Fault Management. The active measurement procedure was performed by periodic Continuity Check Messages (CCM) that were sent between two hosts in a Mininet environment. We evaluated the mea- surement impact for various CCM interval values. IOAM. We tested In-situ measurement using an open source implementation of IOAM in IPv6 [6, 23]. Traffic was sent between two hosts and forwarded along three hops of switches: the switch pushed an IPv6 tunnel with the IOAM en- capsulation, the second was used as an IOAM transit switch and pushed IOAM data into transit packets, and the third pushed its own IOAM data and then removed the IPv6 tunnel 4This is the average packet length in a simple IMIX [20], including IPv6 and L2 headers. 7
Page 8
and IOAM encapsulation. The two hosts and three switches were emulated by five Linux Containers (LXC). 5.2 The Impact-Uncertainty Tradeoff In each of the three experiments we focused on the data rate as the metric 𝑃. Assuming an overprovisioned network, the impact Δ𝑃𝑀(𝜏) is the difference between the data rate includ- ing the measurement and the data rate without the measure- ment. Thus, the impact is equal to the measurement overhead (this statement is confirmed by the evaluation in the next subsection). We define Δ𝑀(𝜏) to be the uncertainty in the detection time of a failure or anomaly, and by Lemma 4.7 we have that Δ𝑀(𝜏) = 𝜏, and the constant 𝜂is equal to the number of overhead bits per period 𝜏. We ran the experiments with various values of 𝜏. In gNMI the period 𝜏is determined by the exporting interval, and similarly the CCM interval determines 𝜏in the active case. In IOAM we varied the sampling ratio, which is the fraction of data traffic that is monitored by IOAM. Since the data rate in the experiment is constant, the sampling ratio determines the measurement interval 𝜏. The experimental results are presented in Fig. 3. In each measurement setup we measured the impact Δ𝑃𝑀(𝜏) as a function of the uncertainty Δ𝑀(𝜏). Each graph also depicts the theoretical curve Δ𝑃𝑀(𝜏) = 𝜂/Δ𝑀(𝜏), illustrating the uncertainty relation (Eq. 3). The theoretical value of 𝜂is the expected overhead per period of each of the protocols. In gNMI, a telemetry message of 𝜂(𝑔𝑁𝑀𝐼) = 204 bytes was exported by the switch in each period. Each CCM is 𝜂(𝐶𝐶𝑀) = 101 bytes long. IOAM was used over a three-hop network, where each hop pushed an overhead of 8 bytes, with a total of 𝜂(𝐼𝑂𝐴𝑀) = 80 bytes including the IPv6 tunnel header (44 bytes), the IPv6 option header (4 bytes) and the IOAM header (8 bytes), which were added by the encapsulating switch. The results confirm the uncertainty relation of Theorem 4.6. In the experiments, Eq. 3 was in fact an equality. As men- tioned in Section 4, it is expected that in some cases the uncertainty Δ𝑀(𝜏) will be greater (satisfying the ‘≥’ in the equation) due to other factors that are not related to the mea- surement overhead. 5.3 Evaluating the Measurement Impact In this experiment we evaluated the measurement impact Δ𝑃𝑀(𝜏) as a function of the computed measurement overhead Θ. We used the IOAM setup, and varied the overhead per time unit by changing the IOAM sampling ratio. Two scenar- ios were tested: (a) the network is overprovisioned and the measurement impacts the data rate without loss impact, and (b) a 1 Mbps flow is forwarded over a 1 Mbps link without overprovisioning, thus causing packet loss. 0 50 100 150 200 250 300 0 50 100 150 200 250 Overhead [kbps] Data rate impact Loss rate impact (a) The measurement impact vs. the measurement overhead in an overprovisioned link. 0 50 100 150 200 250 300 0 50 100 150 200 250 Overhead [kbps] Data rate impact Loss rate impact (b) The measurement impact vs. the measurement overhead without overprovisioning. Figure 4: The correlation between impact and overhead. The experimental results of Fig. 4 confirm the connection between the measurement impact and the overhead, stated in Lemma 4.5. Notably, our definition of impact captures the cost of the measurement whether the network is overprovi- sioned or not; in the overprovisioned case the measurement impacts the data rate, whereas without overprovisioning the measurement impacts the loss rate. 5.4 Evaluating the Observer Effect Scaling In order to evaluate the observer effect at large scales we used a simulation environment. The simulation was implemented in Visual Basic, and its purpose was to evaluate the uncer- tainty relation and specifically the measurement impact in a network with a high data rate, and with a large number of flows. The advantage of a software-based simulation is that it allows analysis of large scales, as opposed to the emula- tion environments of the previous subsections, which ran in a virtualized environment on a conventional PC, and therefore allow up to tens of thousands of packets per second. The simulation environment mimicked the three scenarios of Section 5.1, but at a larger scale. The CCM scenario was simulated with a variable number of flows, between 1 and 100, 000,5 so that each flow was independently monitored by the CCM protocol. For each scale the simulation was repeated with different measurement period values: 3.33 𝑚𝑠, 100 𝑚𝑠, 1 𝑠𝑒𝑐, and 10 𝑠𝑒𝑐. The gNMI scenario was similarly simulated for a variable number of flows and with various values of measurement periods. The in-situ measurement was simulated at various data rates6 from 1 𝑀𝑏𝑝𝑠to 100 𝐺𝑏𝑝𝑠, and the sampling ratio was varied from 1 to 100. Fig. 5 presents simulation results that demonstrate the scal- ing behavior of Lemma 4.8. These results were produced 5Supporting 10s of thousands of monitored flows on a single device is a reasonable use case in carrier networks [24, 25], in which the CCM is used. 6The scaling factor of in-situ measurement is not the number of flows, since in-situ measurement uses per-packet overhead (subject to the sampling ratio). Instead, the scaling factor here was the data rate of the monitored data flow. 8
Page 9
1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09 1E+10 1E+11 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 Impact [bps] Number of Flows 3.33 ms 100 ms 1 sec 10 sec (a) The impact as a function of the number of flows (simulated). Each curve represents a different measurement period. 1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 Observer Factor [bps] Number of Flows (b) The observer factor as a function of the number of flows (computed). 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 1.E+10 1.E+11 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 Impact [bps] Observer Factor [bps] 3.33 ms 100 ms 1 sec 10 sec (c) The impact as a function of the observer factor (simulated). Each curve represents a different measurement period. Figure 5: The scaling of the observer effect as a function of the number of flows. Simulated for active measurement (CCM). from the active measurement (CCM) scenario. Using the no- tations of Lemma 4.8, Fig. 5a depicts the impact, which is Í𝑁−1 𝑖=0 Δ𝑃𝑇𝑖(𝜏), as a function of the number of flows 𝑁. Simi- lar simulation runs were performed in the gNMI scenario and in the in-situ scenario, producing similar scaling behavior, as shown in Fig. 6. Fig. 5b illustrates the computed observer factor in the CCM scenario, 𝑁· 𝜂, as a function of the number of flows 𝑁. A comparison to Fig. 5a demonstrates the correlation between the (simulated) impact and the (computed) observer factor. This correlation is then confirmed in Fig 5c, which shows the impact as a function of the observer factor, demonstrating the uncertainty relation of Lemma 4.8. These simulation results confirm the scaling behavior of the observer effect: the measurement impact scales with the size of the network. Specifically, the simulations confirm the measurement impact scaling that lies in the uncertainty relation of Lemma 4.8. 6 BEYOND THE OBSERVER EFFECT In this paper we analyzed the impact of measurement over- heads on the network performance. However, measuring and observing the network does not only add overhead traffic. It may also, in some cases, consume processing power in net- working devices including hosts, switches and routers. This processing overhead may affect the data plane processing, the control plane processing, or both. In either case, this may impact the network performance. Moreover, the measurement overhead affects not only the network resources, but also the computing and storage resources of the hosts that take part in the measurement protocols and the server(s) that monitor and analyze the performance. 1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09 1E+10 1E+11 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 Number of Flows 3.33 ms 100 ms 1 sec (a) Passive measurement impact as a function of the number of flows (simulated). 1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09 1E+10 1E+11 1E+6 1E+7 1E+8 1E+9 1E+10 1E+11 Data Rate [bps] Ratio=1 Ratio=10 Ratio=100 (b) In-situ measurement impact as a function of the data rate (simulated). Figure 6: Scaling of the measurement impact (simulated) in passive and in in-situ measurement. Each curve rep- resents a different measurement period (a) or sampling ratio (b). Another aspect that should be considered is fate-sharing: the traffic that is used for measuring the network must follow the same path and forwarding policy as the measured user traffic. However, in-situ measurement may increase the size of monitored packets, possibly causing the forwarding behavior to be different from that of the original user traffic. Thus, the measurement protocol may cause data traffic to be forwarded differently than without the measurement. Furthermore, fate- sharing is not guaranteed if in-situ measurement is applied only to a subset of the traffic. Measurement inaccuracy may also be affected by other factors. One factor may be security aspects; attackers may maliciously tamper with network measurements in order to create a false illusion of a network problem or to hide the existence of one [26]. Another factor that may affect the measurement accuracy in wide area networks is net neutrality; 9
Page 10
some service providers have been known to detect and assign higher priority to speed test traffic [27]. Generally speaking, measurements may be affected by intermediate nodes that do not ‘play nice’. These considerations and other aspects of net neutrality have not been in the focus of the current paper, and are worthy of consideration. 7 RELATED WORK The uncertainty principle and the observer effect have been widely discussed in the literature (e.g., [4, 28]). In computer science, the observer effect was analyzed in the context of computing performance [29], but to the best of our knowl- edge the current paper is the first to consider and analyze the observer effect in the context of communication networks. The literature is rich with work about network measurement methods: passive (e.g., [30–33]), active (e.g., [15, 16, 33, 34]), and in-situ measurement [6–9]. The fact that network measurement incurs overhead is com- mon knowledge (see [12, 33, 35–37]), as is the fact that there is a tradeoff between a measurement’s accuracy and its im- pact [10, 27, 38]. We are, however, unaware of prior formal and quantitative analyses of the reciprocal relation between the measurement overhead and the measurement accuracy (and, in particular, in analogy to principles from quantum mechanics). 8 CONCLUSION We presented and formalized an observer effect for computer networks, which captures the interplay between network per- formance and its measurement. The observer effect yields a delicate tradeoff between the required measurement granular- ity and the imposed performance impact, affecting both the networking and computing resources. We believe that the cost efficiency of any network measure- ment and monitoring method should be evaluated with this granularity-impact tradeoff in mind. We view our network observer effect/factor framework as a first step towards better understanding and theoretical evaluation of efficient network measurement. As networks continue to grow in size, requir- ing increasingly high visibility and transparency, the observer effect provides a way to formally capture and analyze how measurement impacts performance. REFERENCES [1] T. Mizrahi, M. Schapira, and Y. Moses, “The observer effect in computer networks,” ACM Applied Networking Research Workshop (ANRW), 2024. [2] D. Frost and S. Bryant, “Packet Loss and Delay Measurement for MPLS Networks.” RFC 6374, 2011. [3] ITU-T G.8113.1/Y.1372.1, “Operations, administration and mainte- nance mechanisms for MPLS-TP in packet transport networks,” 2016. [4] W. Heisenberg, The physical principles of the quantum theory. 1930. [5] D. Firestone, “Hardware-accelerated networks at scale in the cloud,” in ACM SIGCOMM 2017 Workshop on Kernel-Bypass Networks, Keynote, 2017. [6] F. Brockners, S. Bhandari, and T. Mizrahi, “Data Fields for In Situ Operations, Administration, and Maintenance (IOAM),” RFC 9197, IETF, 2022. [7] P4 Consortium, “In-band network telemetry (INT),” technical specifi- cation, 2016. [8] J. Kumar, S. Anubolu, J. Lemon, R. Manur, H. Holbrook, A. Ghanwani, D. Cai, H. Ou, and L. Yizhou, “Inband Flow Analyzer,” Internet-Draft draft-kumar-ippm-ifa-01, Internet Engineering Task Force, Feb. 2019. Work in Progress. [9] C. Kim, A. Sivaraman, N. Katta, A. Bas, A. Dixit, and L. J. Wobker, “In-band network telemetry via programmable dataplanes,” in ACM SIGCOMM Symposium on SDN Research (SOSR), 2015. [10] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh, et al., “HPCC: high precision conges- tion control,” in ACM SIGCOMM, pp. 44–58, 2019. [11] Metro Ethernet Forum, “Ethernet services attributes - phase 3,” MEF 10.3, 2013. [12] A. Morton, “Active and Passive Metrics and Methods (with Hybrid Types In-Between).” RFC 7799, May 2016. [13] “gRPC Network Management Interface (gNMI),” https://github.com/ openconfig/reference/blob/master/rpc/gnmi/gnmi-specification.md. [14] “Speedtest by Ookla,” http://www.speedtest.net/. [15] “Connectivity Fault Management,” IEEE Std 802.1ag, 2007. [16] ITU-T G.8013/Y.1731, “Operations, administration and maintenance (OAM) functions and mechanisms for Ethernet-based networks,” 2015. [17] H. Song, B. Gafni, F. Brockners, S. Bhandari, and T. Mizrahi, “In Situ Operations, Administration, and Maintenance (IOAM) Direct Export- ing,” RFC 9326, IETF, 2022. [18] P4 Consortium, “Telemetry report format,” technical specification, 2018. [19] “Observer Effect repository,” https://github.com/talmi/Obs. [20] Agilent Technologies, “The journal of internet test method- ologies,” https://web.archive.org/web/20130127153505/http: //www.ixiacom.com/pdfs/test_plans/agilent_journal_of_internet_test_ methodologies.pdf, 2007. [21] “Stratum,” https://github.com/stratum/stratum. [22] “802.1ag utilities,” https://github.com/vnrick/dot1ag-utils. [23] “Implementation of IOAM for IPv6 in the Linux Kernel,” https://github. com/IurmanJ/kernel_ipv6_ioam. [24] “Ericsson SPO 1400 Family,” tech. rep., 2012. [25] “Huawei OptiX OSN 550 and OSN 3500,” tech. rep., 2011. [26] T. Mizrahi, N. Sprecher, E. Bellagamba, and Y. Weingarten, “An Overview of Operations, Administration, and Maintenance (OAM) Tools.” RFC 7276, 2014. [27] M. Dischinger, M. Marcon, S. Guha, P. K. Gummadi, R. Mahajan, and S. Saroiu, “Glasnost: Enabling end users to detect traffic differentiation,” in NSDI, pp. 405–418, 2010. [28] W. Heisenberg and F. S. C. Northrop, Physics and philosophy: The revolution in modern science. 1958. [29] T. Mytkowicz, P. F. Sweeney, M. Hauswirth, and A. Diwan, “Observer effect and measurement bias in performance analysis,” Computer Sci- ence Technical Reports CU-CS-1042-08, University of Colorado, Boul- der, 2008. [30] Cisco Systems, “Configuring netflow and netflow data ex- port,” https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/netflow/ configuration/xe-3s/nf-xe-3s-book/cfg-nflow-data-expt-xe.pdf, 2012. [31] P. Aitken, B. Claise, and B. Trammell, “Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Infor- mation.” RFC 7011, Sept. 2013. 10
Page 11
[32] C. Yu, C. Lumezanu, Y. Zhang, V. Singh, G. Jiang, and H. V. Mad- hyastha, “Flowsense: Monitoring network utilization with zero measure- ment cost,” in International Conference on Passive and Active Network Measurement, pp. 31–41, Springer, 2013. [33] Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, L. Yuan, M. Zhang, B. Y. Zhao, et al., “Packet-level telemetry in large datacenter networks,” in ACM SIGCOMM Computer Communication Review (CCR), vol. 45, pp. 479–491, ACM, 2015. [34] R. Mittal, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, D. Zats, et al., “TIMELY: RTT-based Con- gestion Control for the Datacenter,” in ACM SIGCOMM Computer Communication Review (CCR), vol. 45, pp. 537–550, ACM, 2015. [35] H. Wang, K. S. Lee, E. Li, C. L. Lim, A. Tang, and H. Weatherspoon, “Timing is everything: Accurate, minimum overhead, available band- width estimation in high-speed wired networks,” in Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 407–420, 2014. [36] B. Eriksson, P. Barford, and R. Nowak, “Network discovery from passive measurements,” in ACM SIGCOMM, pp. 291–302, 2008. [37] D. A. Popescu and A. W. Moore, “Ptpmesh: Data center network latency measurements using ptp,” in 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommuni- cation Systems (MASCOTS), pp. 73–79, IEEE, 2017. [38] A. Soule, A. Lakhina, N. Taft, K. Papagiannaki, K. Salamatian, A. Nucci, M. Crovella, and C. Diot, “Traffic matrices: balancing mea- surements, inference and modeling,” in Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pp. 362–373, 2005. 11
Canonical Hub: CANONICAL_INDEX